The purpose of this project is to demonstrate the pandas library data manipulation operations. In order to do that, we'll use San Francisco dataset available on (https://www.kaggle.com/kaggle/sf-salaries). This dataset contains annual data of employees from 2011 to 2014.
Importing package
In [3]:
import pandas as pd
Read CSV into dataframe 'sal'
In [4]:
sal = pd.read_csv('Salaries.csv')
Head of dataframe
In [5]:
sal.head()
Out[5]:
Finding some basic info about the data through the info() function
In [6]:
sal.info()
Average BasePay
In [7]:
sal['BasePay'].mean()
Out[7]:
Highest amount of overtime pay in the dataset
In [8]:
sal['OvertimePay'].max()
Out[8]:
What is the job title of JOSEPH DRISCOLL?
In [9]:
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']
Out[9]:
How much does JOSEPH DRISCOLL make (including benefits)?
In [10]:
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']
Out[10]:
The name of highest paid person (including benefits)?
In [11]:
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]
Out[11]:
The name of lowest paid person
In [12]:
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].min()]
Out[12]:
The average (mean) BasePay of all employees per year? (2011-2014)?
In [13]:
sal.groupby('Year').mean()['BasePay']
Out[13]:
Number of unique job titles
In [14]:
sal['JobTitle'].nunique()
Out[14]:
Top 5 most common jobs
In [15]:
grouped = sal.groupby('JobTitle').count()
top5 = grouped.sort_values(by='Id', ascending=False)[:5]
top5['Id']
Out[15]:
Job titles represented by only one person
In [16]:
copied_sal = sal[sal['Year'] == 2013]
group = copied_sal.groupby('JobTitle').count()
count = group[group['Id'] == 1]
count.count()['Id']
Out[16]:
How many people have the word Chief in their job title?
In [17]:
def find_chief(job_title):
if 'chief' in job_title.lower().split():
return True
else:
return False
sal = pd.read_csv('Salaries.csv')
sum(sal['JobTitle'].apply(lambda x: find_chief(x)))
Out[17]:
Is there a correlation between length of the Job Title string and Salary?
In [18]:
sal['title_len'] = sal['JobTitle'].apply(len)
sal[['title_len', 'TotalPayBenefits']].corr()
Out[18]: